NO CODING ZONE
like_this()THE 6 DPLYR FUNCTIONS COVERED:
group_by(): as the name implies - use this to determine
what you want to group your data bysummarise(): used in tandem with group_by - use this to
determine what to summarise within selected groupsmutate(): use this to create a new variableinner_join(): use this to join tablesfilter(): use this to filter rowsselect(): use this to select columnsEND OF BACKGROUND INFO. NOW LET’S BEGIN OUR SCRIPT
NOW YOU ARE READY TO START CODING
Before we start coding - you should save the following files from the git repository to your working directory:
Most important is the csv file. This will be used in a later section.
#install.packages('dplyr')
library(ggplot2)
library(dplyr)
library(data.table)
library(plotly)
DECONSTRUCT THE CODE ABOVE:
install.packages() as “save a local copy of
(this external CRAN package)”library() as “for this working session of R,
activate (this locally saved package)”function(argument(s))You only need to install packages once. You need to activate selected
packages every session. Best practice is to begin any script with the
packages you will need for the task at hand. For this exercise we need
the dplyr package (no surprise) and the
data.table package (explained below).
install.packages('package_name'): makes a local copy
of a package that resides on CRAN
library(package_name): makes all functions contained
within selected package available during the current R working
session
Base R functions are accessible without the need to load any packages
dplyr
functionsThere are several datasets that are part of base R or part of the various packages you might install. In order to make this demonstration self contained, we will be using one of these datasets.
IMPORT INTERNAL DATASET
# READ IN AN INTERNAL DATASET
df.diamonds <- diamonds
IMPORT EXTERNAL CSV
Generally speaking, you will not be using internal datasets for your
projects. It is more common to import external files, such as csv files.
This section demonstrates how to import an external csv file:
# READ IN CSV
df.mapping <- read.csv('clarity_map.csv')
df.mapping <- fread('clarity_map.csv')
DECONSTRUCTING CODE ABOVE:
TRANSLATE CODE TO ENGLISH
Once you import the data - you will see your 2 new objects in the upper right hand console:
NOTE: We did not have to specify a directory to export to or import from. This is because we set our working directory - and R knows to look here.
Calculate total price by table:
df.agg <- df.diamonds %>%
group_by(clarity) %>%
summarise(Total.Price = sum(price), Observations = n())
dim(df.agg)
## [1] 8 3
DECONSTRUCT THE CODE ABOVE
TRANSLATE THE CODE TO ENGLISH
Notice there is a new object in our upper left hand panel called df.agg. This object has 127 rows and 2 columns. Click on object for further inspection reveals the columns are table and total.price as expected.
df.agg <- df.agg %>%
mutate(average.price = Total.Price/Observations)
dim(df.agg)
## [1] 8 4
DECONSTRUCT THE CODE ABOVE
Map the 8 clarity levels to 3 categories as follows:
| clarity | clarity.map |
|---|---|
| I1 | Bad |
| SI2 | Bad |
| SI1 | Medium |
| VS2 | Medium |
| VS1 | Medium |
| VVS2 | Medium |
| VVS1 | Good |
| IF | Good |
Use the following snippet:
df.agg <- df.agg %>%
inner_join(df.mapping, by = 'clarity')
dim(df.agg)
## [1] 8 5
DECONSTRUCT THE CODE ABOVE:
TRANSLATE CODE TO ENGLISH:
As with everything, there is more than one way to accomplish this task. The snippet below demonstrates an alternative method using ifelse. The end result is identical to the inner_join() method - so both methods are correct. A table makes sense when you have a lot of levels. Ifelse is easier if you have just a few levels. Ultimately, this is a personal choice and depends on multiple factors that are different for each individual and each project.
df.agg <- df.agg %>%
mutate(clarity.map2 = ifelse(clarity %in% c('I1','SI2'),'Bad',
ifelse(clarity %in% c('SI1','VS2','VS1','VVS2'),'Medium',
ifelse(clarity %in% c('VVS1','IF'),'Good','Undefined'))))
DECONSTRUCT THE CODE ABOVE:
VERSION 2: (not coded yet - original method in R for Actuaries):
TRANSLATE CODE TO ENGLISH:
Only look at diamonds mapped to “Good”.
df.agg <- df.agg %>%
filter(clarity.map == 'Good')
DECONSTRUCT THE CODE ABOVE:
TRANSLATE THE CODE TO ENGLISH:
Filter will result in less than or equal to the original number of rows. Filter will not impact the number of columns.
Remove the column “clarity.map2”
df.agg1 <- df.agg %>%
select(c('clarity', 'Total.Price', 'Observations', 'average.price', 'clarity.map'))
DECONSTRUCT THE CODE ABOVE:
TRANSLATE CODE TO ENGLISH:
ALTERNATE METHOD FOR TASK 7:
df.agg2 <- df.agg %>%
select(-c('clarity.map2'))
df.agg <- df.agg1
rm(df.agg1, df.agg2)
DECONSTRUCT THE CODE ABOVE:
TRANSLATE THE CODE TO ENGLISH:
We have now taken the diamonds dataset THEN
This was all done sequentially on the same dataset. This is a good representation of how code is created. Very piecemeal and iterative. It is very rare that you go into a project knowing all the tasks you need to complete sequentially. Ultimately, however, you do get your final result - like in our simplifed example above. At this point, you can use the pipe operator to clean up and chain your code together as follows:
df.chained <- df.diamonds %>%
group_by(clarity) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations) %>%
inner_join(df.mapping, by = 'clarity') %>%
mutate(clarity.map2 = ifelse(clarity %in% c('I1','SI2'),'Bad',
ifelse(clarity %in% c('SI1','VS2','VS1','VVS2'),'Medium',
ifelse(clarity %in% c('VVS1','1F'),'Good','Undefined')))) %>%
filter(clarity.map == 'Good') %>%
select(c('clarity', 'average.price'))
Initial Data Prep
Before we begin our data visualization - let’s do some data prep using
the tools above. First, create 2 new variables: carat.group, and
table.group. Then we will create 4 aggregated datasets as follows:
# Create new variable carat.group
df.diamonds <- df.diamonds %>%
mutate(carat.group = ifelse(carat < 1, 0.5,
ifelse(carat < 2, 1.5,
ifelse(carat < 3, 2.5,
ifelse(carat < 4, 3.5, 4.5)))))
# create new variable table.group
df.diamonds <- df.diamonds %>%
mutate(table.group = ifelse(table < 60, 60,
ifelse(table < 70, 70, 95)))
# Create dataset for 1 variable visualizations
df.agg1 <- df.diamonds %>%
group_by(carat.group) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
# Create dataset for 2 variable visualizations
df.agg2 <- df.diamonds %>%
group_by(carat.group, cut) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
# Create dataset for 3 variable visualizations
df.agg3 <- df.diamonds %>%
group_by(carat.group, cut, color) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
# Create datset for 4 variable visualizations
df.agg4 <- df.diamonds %>%
group_by(carat.group, cut, color, table.group) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
Basic GGPLOT2 GRAPH:
Graphs created with ggplot2 use the following form:
ggplot(data,(aes(x,y)) + geom_XXXX()
DECONSTRUCT THE LINE ABOVE:
Begin by creating your ggplot object as described above:
graph.object <- ggplot(df.agg1, aes(x = carat.group, y = average.price, group = 1))
graph.object
scatter1 <- graph.object + geom_point()
line1 <- graph.object + geom_line()
bar1 <- graph.object + geom_bar(stat = "identity")
scatterline1 <- scatter1 + geom_line()
scatter1
bar1
line1
scatterline1
As always with ggplot2 - our first step is to create our ggplot object.
graph.task2 <- ggplot(df.agg2,aes(x = carat.group, y = average.price, color = cut, fill = cut))
Once we build our ggplot object, adding geometries is the same as in Task 1:
scatter2 <- graph.task2 + geom_point()
line2 <- graph.task2 + geom_line()
stackbar2 <- graph.task2 + geom_bar(stat = "identity",position = "stack")
fillbar2 <- graph.task2 + geom_bar(stat = "identity",position = "fill")
dodgebar2 <- graph.task2 + geom_bar(stat = "identity",position = "dodge")
scatterline2 <- scatter2 + geom_line()
scatter2
line2
stackbar2
fillbar2
dodgebar2
scatterline2
There is another powerful layer you can add to your graphs which allows you to analyze higher order dimensions (more variables). This is called faceting. Faceting means making copies of your graphs. We are going to make a bunch of copies of the graphs we built up to now. Each copy will represent a single level of our designated faceting variable(s).
Once again - start with your ggplot object:
graph.task3 <- ggplot(df.agg3,aes(x = carat.group, y = average.price, color = cut, fill = cut))
facet_wrap()
We will start with facet_wrap(). Facet_wrap() allows faceting (making copies of your graph) split by 1 variable.
Next - add our geometries like in previous tasks. We also add an additional facet_wrap() layer to our ggplot object. Adding the facet_wrap() layer to our graphs above simply results in several copies of your prior graphs, each one filtered on a single level of your faceting variable.
scatterline3 <- graph.task3 + geom_line() + geom_point() + facet_wrap(~color)
scatterline3
facet_grid()
Facet_grid() is very similar to facet_wrap(). Where facet_wrap() allows you to facet on one variable, facet_grid() allows you to facet on 2 variables. One variable for the columns, and one for the rows.
graph.task4 <- ggplot(df.agg4,aes(x = carat.group, y = average.price, color = cut, fill =cut))
scatterline4 <- graph.task4 + geom_line() + geom_point() + facet_grid(color ~ table.group) + xlab("Banded Ages") + ylab("Actual to Expected") + ggtitle("Dollar Weighted Actual to Expected Analysis Using 7580E")
scatterline4
Plotly is a wrapper you can put around your ggplot graphs. This wrapper greatly enhances ggplot graphs and is extremely simple to implement. For these reasons - I use it as a default with all my graphs.
In order to use plotly, simply add the ggplotly() function around any of the graph objects we have already created. This will create cleaner graphs as well as additional interactivity:
ggplotly(scatter1)
ggplotly(scatter2)
ggplotly(bar1)
ggplotly(stackbar2)
ggplotly(fillbar2)
ggplotly(dodgebar2)
ggplotly(line1)
ggplotly(line2)
ggplotly(scatterline1)
ggplotly(scatterline2)
ggplotly(scatterline3)
ggplotly(scatterline4)